Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) by Manogna-Sree · Pull Request #19706 · ggml-org/llama.cpp

Manogna-Sree · 2026-02-18T07:11:02Z

Make sure to read the contributing guidelines before submitting a PR

PR #15275 which was previously submitted by us, includes repacking and block interleaving for the Q6K nodes. However, with the latest master, we observed inaccuracies related to the usage of repacked scales present in the master branch.

This PR fixes the inaccuracy issues and contains block interleaving approach for Q6_K quantization for x64/x86 SIMD Architecture
Initial gains were observed with prompt processing with the above changes compared to the tested Q6_K model
The GEMM function was implemented for AVX512/AVX2 and GEMV functions are implemented for the AVX2 architecture

Model	Size	Params	Backend	Threads	Test	t/s (mean ± std)	Speedup	Commit id
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	40.02 ± 0.06	—	e9a859d - Base Commit
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	46.72 ± 0.08	16.79%	a3020c0 - AVX2 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	61.61 ± 0.13	54%	a3020c0 - AVX512 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.41 ± 0.00	—	e9a859d - Base Commit
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.11 ± 0.00	-2.88%	a3020c0 - AVX2 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.104 ± 0.00	-2.94%	a3020c0 - AVX512 Version

GCC Version = 12.3

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

Additionally, the PR was tested for execution with clang on both Linux and windows

The perplexity results with llama2 7B are tabulated as follows:

Model	Perplexity (Final estimate PPL)	Commit id
llama 7B Q6_K	5.8164 ± 0.03250	e9a859d - Base Commit
llama 7B Q6_K	5.8163 ± 0.03250	a3020c0 - Updated Commit

Manogna-Sree requested a review from ggerganov as a code owner February 18, 2026 07:11

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 18, 2026

Manogna-Sree added 3 commits February 19, 2026 13:55

Added GEMM and GEMV Q6_K functions for x86 architecture

f04598f

Update scales mask

ca61710

Enable repacking for AVX512

b886cf9

Manogna-Sree force-pushed the Q6_K_blockinterleaving_implementation branch from d2d844c to b886cf9 Compare February 19, 2026 08:46

Manogna-Sree mentioned this pull request Mar 17, 2026

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275

Open

swetha097 mentioned this pull request Mar 17, 2026

get_rows & dequantize function implementation for repacked weights of type q6_K (q6_Kx8) #16743

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2)#19706

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2)#19706
Manogna-Sree wants to merge 3 commits intoggml-org:masterfrom
Manogna-Sree:Q6_K_blockinterleaving_implementation

Manogna-Sree commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Manogna-Sree commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant